Data exploration is not only about creating numbers and summary statistics. Sometimes a beautiful plot reveals more exciting insights into data. In this exercise, we exploit what we’ve just learned about plots in R and in particular in ggplot2. Now we’re going to use all of the gapminder GDP data!

1

Load the gapminder GDP data in the long format as in the Summary Statistics exercise. Make sure not to exclude the time period between 1970 and 2001.
Remember that we applied the filter()-function for choosing the individual time periods.
gapminder_ggplot_input <- 
  readxl::read_excel(
    path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
    sheet = "Data"
    ) %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP) %>%
  group_by(year) %>% 
  summarise(GDP_over_all_countries = mean(GDP))

Previously, we only have analyzed how the period of 1960-1969 compares to the period of 2002-2011. The nice thing about plots is that we can make use of the whole range of years and still identify differences between various periods. Our plot of choice, therefore, is a line plot to create a nice time series.

2

Plot the gapminder data as a line plot to get a time series.
Instead of geom_point as in the slides, the geom’s name is geom_line. Moreover, in the aesthetics definition aes() you may want to define a grouping variable group = 1; otherwise, ggplot thinks you want to plot one line for each year.
ggplot(
  data = gapminder_ggplot_input,
  aes(x = year, y = GDP_over_all_countries, group = 1)
) +
  geom_line()

Admittedly, this may not be the best approach to identify differences between the periods directly. We don’t know when our periods start and when they end. Luckily, this can be fixed using at least two approaches. Let’s start with the first one: using colors for different periods. For this purpose, we need an indicator variable as a grouping variable that applies different colors to the line at each period.

3

Create an indicator variable for the time periods 1960-1969, 2002-2011 and the time inbetween.
A combination between mutate() and the if_else lets you create new variables rather easily. Moreover, to get some sensible legend labels later define them as strings.
gapminder_ggplot_input <-
  gapminder_ggplot_input %>% 
  mutate(
    period = 
      if_else(
        year >= 1960 & year <= 1969, 
        "1960-1969",
        if_else(
          year >= 2002 & year <= 2011, 
          "2002-2011", 
          "1970-2001")
      )
  )

After we’re set up with our indicator variable, it’s plotting time again. We can simply re-use our code from before and define a grouping color in the aesthetics definition. Try it out!

4

Plot the line plot once again with different colors for the different time periods.
In the aesthetics defintion aes(), you can choose the option color = indicator_variable to define the grouping.
ggplot(
  data = gapminder_ggplot_input,
  aes(
    x = year, 
    y = GDP_over_all_countries, 
    color = period, 
    group = 1
  )
) +
  geom_line()

Now we can see some visual differences between the different periods. One last thing, however, is that there are way too many labels on the x-axis. Maybe a more sensible labeling approach would be to create axis breaks for every ten years steps.

5

Create some prettier, i.e., more sensible breaks for the x-axis.
You can modify the x-axis with scale_x_discrete() and its breaks with the option breaks = breaks_vector.
ggplot(
  data = gapminder_ggplot_input,
  aes(
    x = year, 
    y = GDP_over_all_countries, 
    color = period, 
    group = 1
  )
) +
  geom_line() +
  scale_x_discrete(
    breaks = seq(
      from = 1960, 
      to = 2011,
      by = 10
    )
  )